214 research outputs found
An Approach to Web-Scale Named-Entity Disambiguation
We present a multi-pass clustering approach to large scale. wide-scope named-entity disambiguation (NED) oil collections of web pages. Our approach Uses name co-occurrence information to cluster and hence disambiguate entities. and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasing), difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information front documents
Spatial correlations in attribute communities
Community detection is an important tool for exploring and classifying the
properties of large complex networks and should be of great help for spatial
networks. Indeed, in addition to their location, nodes in spatial networks can
have attributes such as the language for individuals, or any other
socio-economical feature that we would like to identify in communities. We
discuss in this paper a crucial aspect which was not considered in previous
studies which is the possible existence of correlations between space and
attributes. Introducing a simple toy model in which both space and node
attributes are considered, we discuss the effect of space-attribute
correlations on the results of various community detection methods proposed for
spatial networks in this paper and in previous studies. When space is
irrelevant, our model is equivalent to the stochastic block model which has
been shown to display a detectability-non detectability transition. In the
regime where space dominates the link formation process, most methods can fail
to recover the communities, an effect which is particularly marked when
space-attributes correlations are strong. In this latter case, community
detection methods which remove the spatial component of the network can miss a
large part of the community structure and can lead to incorrect results.Comment: 10 pages and 7 figure
Factors Affecting Web Page Similarity
Abstract. Tools that allow effective information organisation, access and navigation are becoming increasingly important on the Web. Sim-ilarity between web pages is a concept that is central to such tools. In this paper, we examine the effect that content and layout-related as-pects of web pages have on web page similarity. We consider the textual content contained within common HTML tags, the structural layout of pages, and the query terms contained within pages. Our study shows that combinations of factors can yield more promising results than individual factors, and that different aspects of web pages affect similarities between pages in a different manner. We found a number of factors that, when taken into account, can result in effective measures of similarity between web pages. Query information in particular, proved to be important for the effective organisation of web pages.
Measuring player’s behaviour change over time in public goods game
An important issue in public goods game is whether player's behaviour changes over time, and if so, how significant it is. In this game players can be classified into different groups according to the level of their participation in the public good. This problem can be considered as a concept drift problem by asking the amount of change that happens to the clusters of players over a sequence of game rounds. In this study we present a method for measuring changes in clusters with the same items over discrete time points using external clustering validation indices and area under the curve. External clustering indices were originally used to measure the difference between suggested clusters in terms of clustering algorithms and ground truth labels for items provided by experts. Instead of different cluster label comparison, we use these indices to compare between clusters of any two consecutive time points or between the first time point and the remaining time points to measure the difference between clusters through time points. In theory, any external clustering indices can be used to measure changes for any traditional (non-temporal) clustering algorithm, due to the fact that any time point alone is not carrying any temporal information. For the public goods game, our results indicate that the players are changing over time but the change is smooth and relatively constant between any two time points
An effective non-parametric method for globally clustering genes from expression profiles
Clustering is widely used in bioinformatics to find gene correlation patterns. Although many algorithms have been proposed, these are usually confronted with difficulties in meeting the requirements of both automation and high quality. In this paper, we propose a novel algorithm for clustering genes from their expression profiles. The unique features of the proposed algorithm are twofold: it takes into consideration global, rather than local, gene correlation information in clustering processes; and it incorporates clustering quality measurement into the clustering processes to implement non-parametric, automatic and global optimal gene clustering. The evaluation on simulated and real gene data sets demonstrates the effectiveness of the algorithm. <br /
A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences
Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.
Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.
Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences
Clustering daily patterns of human activities in the city
Data mining and statistical learning techniques are powerful analysis tools yet to be incorporated in the domain of urban studies and transportation research. In this work, we analyze an activity-based travel survey conducted in the Chicago metropolitan area over a demographic representative sample of its population. Detailed data on activities by time of day were collected from more than 30,000 individuals (and 10,552 households) who participated in a 1-day or 2-day survey implemented from January 2007 to February 2008. We examine this large-scale data in order to explore three critical issues: (1) the inherent daily activity structure of individuals in a metropolitan area, (2) the variation of individual daily activities—how they grow and fade over time, and (3) clusters of individual behaviors and the revelation of their related socio-demographic information. We find that the population can be clustered into 8 and 7 representative groups according to their activities during weekdays and weekends, respectively. Our results enrich the traditional divisions consisting of only three groups (workers, students and non-workers) and provide clusters based on activities of different time of day. The generated clusters combined with social demographic information provide a new perspective for urban and transportation planning as well as for emergency response and spreading dynamics, by addressing when, where, and how individuals interact with places in metropolitan areas.Massachusetts Institute of Technology. Dept. of Urban Studies and PlanningUnited States. Dept. of Transportation (Region One University Transportation Center)Singapore-MIT Alliance for Research and Technolog
A genetic approach for building different alphabets for peptide and protein classification
<p>Abstract</p> <p>Background</p> <p>In this paper, it is proposed an optimization approach for producing reduced alphabets for peptide classification, using a Genetic Algorithm. The classification task is performed by a multi-classifier system where each classifier (Linear or Radial Basis function Support Vector Machines) is trained using features extracted by different reduced alphabets. Each alphabet is constructed by a Genetic Algorithm whose objective function is the maximization of the area under the ROC-curve obtained in several classification problems.</p> <p>Results</p> <p>The new approach has been tested in three peptide classification problems: HIV-protease, recognition of T-cell epitopes and prediction of peptides that bind human leukocyte antigens. The tests demonstrate that the idea of training a pool classifiers by reduced alphabets, created using a Genetic Algorithm, allows an improvement over other state-of-the-art feature extraction methods.</p> <p>Conclusion</p> <p>The validity of the novel strategy for creating reduced alphabets is demonstrated by the performance improvement obtained by the proposed approach with respect to other reduced alphabets-based methods in the tested problems.</p
A Normalized Tree Index for identification of correlated clinical parameters in microarray experiments
Martin C, Tauchen A, Becker A, Nattkemper TW. A Normalized Tree Index for identification of correlated clinical parameters in microarray data. BioData Mining. 2011;4(1): 2.BACKGROUND:
Measurements on gene level are widely used to gain new insights in complex diseases e.g. cancer. A promising approach to understand basic biological mechanisms is to combine gene expression profiles and classical clinical parameters. However, the computation of a correlation coefficient between high-dimensional data and such parameters is not covered by traditional statistical methods.
METHODS:
We propose a novel index, the Normalized Tree Index (NTI), to compute a correlation coefficient between the clustering result of high-dimensional microarray data and nominal clinical parameters. The NTI detects correlations between hierarchically clustered microarray data and nominal clinical parameters (labels) and gives a measurement of significance in terms of an empiric p-value of the identified correlations. Therefore, the microarray data is clustered by hierarchical agglomerative clustering using standard settings. In a second step, the computed cluster tree is evaluated. For each label, a NTI is computed measuring the correlation between that label and the clustered microarray data.
RESULTS:
The NTI successfully identifies correlated clinical parameters at different levels of significance when applied on two real-world microarray breast cancer data sets. Some of the identified highly correlated labels confirm the actual state of knowledge whereas others help to identify new risk factors and provide a good basis to formulate new hypothesis.
CONCLUSIONS:
The NTI is a valuable tool in the domain of biomedical data analysis. It allows the identification of correlations between high-dimensional data and nominal labels, while at the same time a p-value measures the level of significance of the detected correlations
- …